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(Ih ' Abstract 



We describe our solution to the Banff 2 challenge problems as well as the out- 
comes. 
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1. Introduction 

In July of 2010 a conference was held on the statistical issues relevant to 
significance of discovery claims at the LHC. The conference took place at the 
Banff International Research Station in Banff, Alberta, Canada. After many 
discussions it was decided to hold a competition to see which methods would 
perform best. One of the participants, Thomas Junk, would create a large 
number of data sets, some with a signal and some without. There were two 
main parts to the competition: 

Problem 1 was essentially designed to see whether the methods could cope 
with the "look-elsewhere" effect, the issue of searching through a mass spectrum 
for a possible signal. 

Problem 2 was concerned with the problem that sometimes there are no 
known distributions for either the backgrounds or the signal and they have to 
be estimated via Monte Carlo. 

For a detailed description of the problems as well as the data sets and a dis- 
cussion of the results see Tom Junk's CDF web page at http://www-c df.fnal.gov/~trj/] 
In this paper we will present a solution based on the likelihood ratio test, and 
discuss the performance of this method in the challenge. 

2. The method 

Our solution for both problems is based on the likelihood ratio test statistic 
A(x) = 2 ('max{logL(6'|x) : 6>}-max{logL(6'|x) : 9 e &°}) 

According to standard theorems in Statistics A(X) often has a x^ distribution 
with the number of degrees of freedom the difference between the number of free 
parameters and the number of free parameters under the null hypothesis. This 
turns out to be true for problem 2 but not for problem 1, in which case the null 
distribution can be found via simulation. 



2.1 


. Problem 1 
Here we have: 
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h{x; a, E) — {1 — a)f{x) + ag{x; E) 

Hq : a — vs Ha : a > 

logL{a, £;|x) = X;r=i log [(1 - o:)fixt) + ag{x,;E)] 

Now max{logL(a, -E|x)} is the log likelihood function evaluated at the maxi- 
mum likelihood estimator and max{logL(a, £'|x) : e ©°} =logL(0, 0|x). Note 
that if a = any choice of E yields the same value of the likelihood function. 



In the following figure we have the histogram of 100000 values of A(x) for 
a simulation with n = 500 and a = together with the densities of the x^ 
distribution with df's from 1 to 5. Clearly non of these yields an acceptable fit. 
Instead we use the simulated data to find the 99% quantile and reject the null 
hypothesis if A(x) is larger than that, shown as the vertical line in the graph. 

In general the critical value will depend on the sample size, but for those in 
the challenge (500 — 1500) it is always about 11.5. 

If it was decided to do discovery using 5a the critical value can be found using 
importance sampling. Recently Eilam Gross and Ofer Vitells have developed an 
analytic upper bound for the tail probabilities of the null distribution, see " Trial 
factors for the look elsewhere effect in high energy physics", Eilam Gross, Ofer 
Vitells, Eur.Phys.J.G70:525-530,2010. Their result agrees with our simulations. 

Finding the mle is a non-trivial exercise because there are many local min- 
ima. The next figure shows the log-likelihood as a function of E with a fixed 
at 0.05 for 4 cases. 

To find the mle we used a two-step procedure: first a fine grid search over 
values of E from —0.015 to 1 in steps of 0.005. At each value of E the corre- 
sponding value of a that maximizes the log-likelihood is found. In a second step 
the procedure starts at the best point found above and uses Newton-Raphson 
to find the overall mle. 

2.2. Problem 2 

Again we want to use: 

/i(x;a,/3) = (1 - a- l3)fi{x) + I3f2{x) + ag{x) 

Hq : a = vs Ha : a > 

log L{a, /3|x) =Er=i log [(1 - a - P)hix) + Phi^) + M^)] 

Now max{logi(Q!, /3|x)} is the log likelihood function evaluated at the maxi- 
mum likelihood estimator and max{ log i(Q!, /3|x) : 6 £ @ } = max{logL(0, /3|x) 

/3}. 

The difficulty is of course that we don't know /i, /2 or g. We have used 
three different ways to find them: 

2.2.1. Parametric Fitting 

Here one tries to find a parametric density that gives a reasonable fit to the 
data. For the data in the challenge this turns out to be very easy. In all three 
cases a Beta density gives a very good fit: 

2.2.2. Nonparametric Fitting: 

There are a variety of methods known in Statistics for non-parametric density 
estimation. The difficulty with the data in the challenge is that it is bounded 
on a finite interval, a very common feature in HEP data. Moreover the slope of 
the density of Background 1 at is infinite. I checked a number of methods and 
eventually ended up using the following: for Background 2, the Signal and the 
right half of Background 1 i bin the data (250 bins) find the counts and scale 
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Figure 1: Histogram of 100000 values of the null distribution, with several fits from chi-square 
distributions and gO"* percentile. 
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Figure 2: Log-Likelihood as a function of signal location E. a=0.05 
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Figure 3: Parametric fits to the three MC samples. 



them to integrate to unity. Then i use the non-parametric density estimator 
loess from R with the defauh span (smoothing parameter). This works well 
except on the left side of Background 1. There the infinite slope of the density 
would require a smoothing parameter that goes to 0. Instead i transform the 
data with log j^j . The resulting data has a density without boundary, which 
i estimate using the routine density from R, again with the default bandwidth. 
This is then back-transfomed to the 0-1 scale. This works well for the left side 
but not the right one, and so i "splice" the two densities together in the middle. 
The resulting densities are shown here: 

2.2.3. Semiparametric Fitting 

It is possible to combine the two approaches above: fit some of the data 
parametrically and others non-parametrically, for example if the signal is known 
to have a Gaussian distribution but the background density is Monte Carlo data. 

For the data in the challenge the three methods give very similar results, so 
i am submitting only the solution using the parametric fits. 

2.2.4. Back to the Test 

What is the null distribution of A(x) now? In the following figure we have 
the histogram of 5000 values of A(x) for a simulation with 500 events from 
Background 1, 100 events from Background 2 and no Signal events, a = 
together with the density of a x^ distribution with 1 df. The densities arc fit 
parametrically. 

Clearly this yield a very good fit, so we will reject the null hypothesis if 
A(x) > (7x^(0.99, 1) = 6.635, the 99*^* percentile of a chi-square distribution 
with 1 degree of freedom. The same result holds if the fitting was done non 
parametrically or semi parametrically. 

2.3. Error Estimation 

We have the following large sample theorem for maximum likelihood esti- 
mators: if we have a sample Xi, .., Xn with density f(x;p) and p is the mle for 
the parameter p, then under some regularity conditions 

V^(p-p)-iV(0,cr) 

where cr^ = — 1 /nE[-j-2 log f{X;p)], the Fisher information number. This can 

be estimated using the observed Fisher information number — X]r=i T^ ^ogh{Xi;p). 
For problem 1 we find: 
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Figure 4: Non-parametric fits to the three MC samples. 
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Figure 5: Histogram of null distribution for problem 2, with chi-square density. 
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so the standard error of a is estimated with 
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For problem 2 the errors are found similarly. 

How good are these errors? As always with large sample theory, there is a 
question whether it works for a specific problem at the available sample sizes. 
Here is the result of a mini MC study: we generate 500 events from the back- 
ground of problem 1 and 4 events from the signal with E — 0.8. This is repeated 
2000 times. The histogram of estimates for the mixing ratio a is shown here: 
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Figure 6: Histogram of estimates for the mixing ratio a. MC generated 500 events from the 
background of problem 1 and 4 events from the signal with E=0.8. This is repeated 2000 
times. 
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Clearly the distribution of the estimates is not Gaussian, and therefore the 
errors are wrong. 

So we need a different method for finding the 68% intervals. This can be done 
via the statistical bootstrap. In our submission we will include intervals based 
on the bootstrap, specifically the 16*'' and the 84*'* percentile of a bootstrap 
sample of size 300. In a real live problem it would be easy to run a mini MC for 
a specific case and then decide which errors to use. Because we have to process 
20000 data sets we can not do so for each individually, and therefore use of the 
bootstrap intervals. 

2.4- Power Studies 

Problem 1:. Wc used the following code to do the power study: 
counter=G; 
for(int k==0;k<lGGGO;++k) { 

nsig=rpois(0.07519885*D,seed); 

for (int i=G; i<nsig; -\ — hi) x[i]=rnorm(E,sced); 

nback=rpois(lGGO,secd); 

for (int i=0; i<nback; H — hi) x[nsig+i]=rbackground(sced) 

lrt=findLRT(x); 

if(lrt>11.5) ++counter; 

} 

power=countcr/M; 

The results are : 
(£>, E) Power 

(1010,0.1) 0.356 

(137,0; 0.5) 0.457 

(18,0; 0,9) 0.184 

Problem 2:. We used the following code to do the power study: 
counter^^O; 
for(int k=0;k<10000;++k) { 

nl=rnorm(l,900,90) 

xl=sample(bc2p2blmc,sizc=nl) 

n2=max(rnorm(l,100,10G),0) 

x2=:sample(bc2p2b2mc,size=n2) 

n3=rpois(75,seed); 

x3=sample(bc2p2sigmc,size=n3) 

nback=rpois(1000,seed); 

x=c(xl,x2,x3) 

lrt=findLRT(x); 

if(lrt>6.635) ++counter; 

} 

power=countcr/M; 

The result is a power of 88%. 
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3. Discussion of Results 

For a detailed discussion of the performance of all the submitted methods 



see Tom Junk's CDF web page at http://www-cdf.fnal. gov/^trj . Here we will 
discuss only the results of our method. 

3.1. Problem 1 

True type I error probability 1.03% 

Missed signals: 53.7% 

So the method achieves the desired type I error probability of 1%. 

How about the errors? For the cases were a signal was claimed correctly the 
nominal 68% CI included the true number of signal events 59% of the time and 
the true signal location 63%, somewhat lower than desired. This is unexpected 
because these errors were tested via simulation. For example, for one of the 
cases in the data set, E — 0.38 and 40 signal events, the true error rates are 
86.7% for the number of signal events and 67.5% for the signal location. It turns 
out, though, that the error estimates are quite bad in cases were the signal rate 
is very low. For example for the case E = 0.38 and 20 signal events the true 
error rates are 53.6% for the number of signal events and 40.2% for the signal 
location. 

We also checked the errors based on the Fisher Information as described 
above. Their performance was comparable to the bootstrap errors. The conclu- 
sion is that for cases were the signal is small error estimates are difficult. 

3.2. Problem 2 

True type I error probability 2.56% 
Missed signals: 22.5% 

So for this problem the type I error probability is too large. The reason turns 
out to be the parametric fit. We used Beta densities for all components, specif- 
ically i3eta(0.4, 1) for background 1, i3eta(l,l)(= t/m/(0, 1)) for background 
2 and Beta{A.75, 1) for the signal. These give excellent fits to the MC data 
provided. For example, generating 5000 random variates from a i?eia(0.4, 1), 
running the Kolmogorov-Smirnofftest and repeating this procedure many times 
rejects the null hypothesis of equal distributions only about 7% of the time, at 
a nominal 5% rate. This problem was caused by the size of the MC samples. If 
instead of 5000 MC events there had been 50000 the same procedure as above 
would have rejected the null hypothesis of equal distributions 98% of the time, 
and a better fitting density would have to be found. The real conclusion here 
is that a very good density estimate is required to make this test work. 
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